Inner TRIM3 Masthead

Guidelines for Matching Records from Consecutive Months or Years of CPS Surveys

Background Information

The occupants of housing units in a CPS survey are interviewed for four consecutive months, excluded for eight months, then included in the survey for an additional four months. The month in sample (household level H-MIS) field, which ranges in value from 1 to 8, indicates which survey month each sampled person is in. Persons with H-MIS codes of 4 or 8 are said to be in outgoing rotation groups because they will be out of the sample the following month. Those with a value of 4 will be back in the survey after eight months, and at that time their H-MIS will be 5. Approximately one-quarter of a sample is in one of the two outgoing rotation groups. Thus, if you are matching respondents from consecutive months of CPS surveys, you will only be able to match less than three-quarters of the respondents. If you are matching respondents from the same month of two consecutive survey years, you will only be able to match less than one-half of the respondents (persons having H-MIS codes of 1-4 in the earlier year and 5-8 in the later year).

In trying to match persons from one survey to the next, it is important to know that the CPS survey is a survey of addresses, not of households. Thus, if the occupants of a household move and the survey unit they occupied is occupied by new residents in subsequent months, the same household and person identifiers used for the former residents will be assigned to the new residents. H-HHNUM, which the Census documentation refers to as "household number" but which might be better described as "household occupant identifier," may be used to determine if the occupants of a structure are the same in both surveys. It ranges in value from 1 to 8. It is set to 1 for the first occupants of a sample unit and remains 1 unless original occupants move, in which case any subsequent occupants are numbered consecutively.

Also in matching, it is important to know that household sequence number cannot be used to match records from one month to another. Household sequence number is unique within a single file, but there is no correspondence between sequence numbers on one file and sequence numbers on any other file. Instead, scrambled household identifier (a 12- or 15-character field, depending on year, named H-IDNUM or HRHHID, depending on source) must be used for matching. Census contructs scrambled household identifier in part based on geographical location, e.g., metropolitan statistical area residence status code (GEMETSTA) or other similar code, and a random number. The important thing to know about it is that it is not necessarily unique. It must be used in conjunction with H-MIS, H-HHNUM, and LINENO for longitudinal matches. Even then, there will be duplicate records on these four match fields and match failures. Other fields such as race and gender must also be used to eliminate false positive matches.

Procedure

To match records from two consequtive months, divide the records in each file into four groups according to H-MIS. Group 1: H-MIS = 1 or 5, Group 2: H-MIS = 2 or 6, Group 3: H-MIS = 3 or 7, and Group 4: H-MIS = 4 or 8. Delete all Group 4 records from the Month 1 file and all Group 1 records from the Month 2 file. These records are represented in just one of the files, not both. You will be left with approximately three-quarters of your sample, the group that is potentially represented on both files. To match the remaining records, match by scrambled household identifier (H-IDNUM), household occupant identifier (H-HHNUM), and person line number (LINENO or PULINENO, depending on source) by group. Match Month 1, Group 1 records to Month 2, Group 2 records, Month 1, Group 2 records to Month 2, Group 3 records, and Month 1, Group 3 records to Month 2, Group 4 records.

Alternatively, if you are matching records for the same month of consecutive years, then delete all records having H-MIS codes of 5-8 from the first year and all records having H-MIS codes of 1-4 from the second year. You will be left with approximately one-half of your original sample in each year. Match the remaining records on H-IDNUM, H-HHNUM, and LINENO within the same groups, i.e., Year 1, Group 1 to Year 2, Group 1; Year 1, Group 2 to Year 2, Group 2, etc.

This match procedure will yield what National Bureau of Economic Research (NBER) researchers refer to as a "naive" match of records. The match will include some false positives and will miss other matches that cannot be made due to errors in the CPS identifiers. Anyone performing a longitudinal match of CPS respondents should refer to the excellent paper written by Brigitte C. Madrian and Lars John Lefgren, "A Note On Longitudinally Matching Current Population Survey (CPS) Respondents" at www.nber.org/data/cps_match.php. These authors discuss reasons for match failures, results to expect (a "naive" match rate of approximately 71% of all persons from year 1 who have potential matches in year 2), and provide information needed to screen the false positives from the naively matched set of records (using measures such as sex, race, age, and education).

Some Causes of Match Failures

The NBER researchers discuss four factors that significantly reduce match rate: sample non-response, mortality, migration, and recording errors. Please refer to that paper a detailed discussion of their findings.

Some normal events such as birth of a child, death, marriage, divorce, and children or other relatives or non-relatives moving in or out will change household composition. With all of these changes, all but one or perhaps a few records identified by a scrambled household identifier may be correctly matched, but one file may be missing one or more records for a given household. Take care with a match failure of a head of household or head of family. The absence of family or household head could cause processing problems later on. It may be necessary to delete entire households or families whose head has not been successfully matched.

Click on the link below to obtain a SAS code that has been used to match respondents from the March 2002 and 2003 CPS surveys. Note that the naive match rate from the match of respondents from these two surveys using this code is just 62 percent, less than the 71 percent match rate reported by Madrian and Lefgren for survey years 1980-1998. One hypothesis for this discrepancy is that the increased CPS sample sizes in more recent years has resulted in change(s) to one or more of the four primary match fields that results in a lower match rate.

SAS Code Used to Match Respondents from the March 2002 and March 2003 CPS